options(timeout = 6000)
if(!require("openxlsx")){install.packages(c("tidyverse", "openxlsx", "reticulate", "quantmod", "feasts",
"tsibble", "fable", "crypto2", "ggsci", "WDI"))}R basics & time series datasets
1 Introduction
1.1 About
1.2 Structure of the crash course
- Introduction to R basics & datasets + summary statistics
- Theory 1: randomness, stationarity, unit roots, random walk
- Theory 2: autocorrelation, ARMA models, predictive regressions
- Advanced models (VAR, GARCH and neural networks) and prediction
2 Packages
There are currently more than 23,000 packages on the CRAN.
But many more exist on github. They are however not vetted/verified.
2.1 Installation
Just run the following chunk:
2.2 Loading
library() is the equivalent to “import…” in python.
library(tidyverse) # THE library for data science
library(openxlsx) # A cool package to deal with Excel files/formats
library(quantmod) # Package for financial data extraction
library(tsibble) # TS with dataframes framework
library(fable) # Package for time-series models & predictions
library(feasts) # Package for time-series analysis
library(crypto2) # Package to access crypto data
library(ggsci) # For cool plot palettes & colors
library(WDI) # For World Bank data3 Things to know with R
3.1 Data structures, assigning & indexing
The equal sign “=” goes both ways. Arrows don’t!
a <- 6
a[1] 6
Vectors of integers….
1:12 [1] 1 2 3 4 5 6 7 8 9 10 11 12
12:1 [1] 12 11 10 9 8 7 6 5 4 3 2 1
Sequences of equally-spaced numbers.
seq(0, 1, by = 0.1) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Custom vectors. This syntax is strange (compared to python/MATLAB), but super IMPORTANT in R.
Essentially, you embed the elements with the simple c() function.
c(2, 3, 5, 8, 13)[1] 2 3 5 8 13
Matrices.
First method: stacking vectors (or smaller matrices).
rbind(1:5, 2:6) [,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 3 4 5 6
cbind(1:5, 2:6) [,1] [,2]
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
Second method: filling with values.
matrix(1:12, nrow = 3, byrow = T) [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
Indexing. It starts at one!
M <- matrix(1:72, nrow = 6)
M [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 7 13 19 25 31 37 43 49 55 61 67
[2,] 2 8 14 20 26 32 38 44 50 56 62 68
[3,] 3 9 15 21 27 33 39 45 51 57 63 69
[4,] 4 10 16 22 28 34 40 46 52 58 64 70
[5,] 5 11 17 23 29 35 41 47 53 59 65 71
[6,] 6 12 18 24 30 36 42 48 54 60 66 72
M[2:5, 2:5] # rows first, then columns, ALWAYS! [,1] [,2] [,3] [,4]
[1,] 8 14 20 26
[2,] 9 15 21 27
[3,] 10 16 22 28
[4,] 11 17 23 29
Dataframes: we can build them column-by-column. They are sometimes called “tibbles” in R.
Most of the time, they are just imported from an external source (Excel spreadsheet).
Below, we build one from scratch (this is often useful).
n <- 12
data.frame(t = 1:n, x = rnorm(n), y = log(1:n))| t | x | y |
|---|---|---|
| 1 | 1.3095721 | 0.0000000 |
| 2 | -0.4957079 | 0.6931472 |
| 3 | 0.1053022 | 1.0986123 |
| 4 | 0.1794539 | 1.3862944 |
| 5 | 0.0283460 | 1.6094379 |
| 6 | -1.2470486 | 1.7917595 |
| 7 | -0.5993684 | 1.9459101 |
| 8 | -0.0833312 | 2.0794415 |
| 9 | -1.2958980 | 2.1972246 |
| 10 | 0.1116494 | 2.3025851 |
| 11 | 0.6284734 | 2.3978953 |
| 12 | -0.9536080 | 2.4849066 |
3.2 From type to type
Number character.
as.character(4)[1] "4"
Character number.
as.numeric("4.5")[1] 4.5
Text date. Don’t forget the ISO-8601 standard…
as.Date("13/02/2005", format = "%d/%m/%Y")[1] "2005-02-13"
Dates again.
as.Date("09/11/01", format = "%m/%d/%y")[1] "2001-09-11"
Factors.
as.factor(c("Large", "Medium", "Small", "Small", "Medium", "Large"))[1] Large Medium Small Small Medium Large
Levels: Large Medium Small
3.3 The tidyverse + piping
The tidyverse is an ecosystem of packages that are incredibly useful in data science tasks.
In particular:
- dplyr and tidyr for wrangling
- ggplot for plotting
- stringr for string management
- lubridate for date management
For this section, we need data. Below, we import macroeconomic information from the World Bank API.
Technically speaking, this data is of panel type, but it’s great to work with.
wb_data <- WDI( # World Bank data
indicator = c(
"pop" = "SP.POP.TOTL", # Population
"pop_growth" = "SP.POP.GROW", # Population growth
"gdp_percap" = "NY.GDP.PCAP.CD", # GDP per capita
"gdp" = "NY.GDP.MKTP.CD", # Gross Domestic Product (GDP)
"R_D" = "GB.XPD.RSDV.GD.ZS", # R&D (%GDP)
"high_tech_exp" = "TX.VAL.TECH.MF.ZS", # High tech exports (%)
"inflation" = "FP.CPI.TOTL.ZG", # Inflation rate
"educ_spending" = "SE.XPD.TOTL.GD.ZS" # Education spending (%GDP)
),
extra = TRUE,
start = 1960,
end = 2024) # |> filter(lastupdated == max(lastupdated))Filtering observations ( = operate on rows).
filter(wb_data[,1:8], year > 2012, country == "India")| country | iso2c | iso3c | year | status | lastupdated | pop | pop_growth |
|---|---|---|---|---|---|---|---|
| India | IN | IND | 2014 | 2025-10-07 | 1312277191 | 1.2612903 | |
| India | IN | IND | 2015 | 2025-10-07 | 1328024498 | 1.1928556 | |
| India | IN | IND | 2013 | 2025-10-07 | 1295829511 | 1.3327043 | |
| India | IN | IND | 2016 | 2025-10-07 | 1343944296 | 1.1916297 | |
| India | IN | IND | 2017 | 2025-10-07 | 1359657400 | 1.1623961 | |
| India | IN | IND | 2024 | 2025-10-07 | 1450935791 | 0.8907065 | |
| India | IN | IND | 2018 | 2025-10-07 | 1374659064 | 1.0972991 | |
| India | IN | IND | 2023 | 2025-10-07 | 1438069596 | 0.8832895 | |
| India | IN | IND | 2022 | 2025-10-07 | 1425423212 | 0.7902005 | |
| India | IN | IND | 2021 | 2025-10-07 | 1414203896 | 0.8226482 | |
| India | IN | IND | 2020 | 2025-10-07 | 1402617695 | 0.9734386 | |
| India | IN | IND | 2019 | 2025-10-07 | 1389030312 | 1.0400140 |
Ok so we see the ordering is a mess (look at the years)…
Let’s sort that out.
wb_data <- arrange(wb_data, country, year)
tail(wb_data, 5) | country | iso2c | iso3c | year | status | lastupdated | pop | pop_growth | gdp_percap | gdp | R_D | high_tech_exp | inflation | educ_spending | region | capital | longitude | latitude | income | lending | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17286 | Zimbabwe | ZW | ZWE | 2020 | 2025-10-07 | 15526888 | 1.659353 | 1730.454 | 26868564055 | NA | 2.383868 | 557.20182 | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| 17287 | Zimbabwe | ZW | ZWE | 2021 | 2025-10-07 | 15797210 | 1.726011 | 1724.387 | 27240507842 | NA | 1.053740 | 98.54611 | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| 17288 | Zimbabwe | ZW | ZWE | 2022 | 2025-10-07 | 16069056 | 1.706209 | 2040.547 | 32789657378 | NA | 1.476931 | 104.70517 | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| 17289 | Zimbabwe | ZW | ZWE | 2023 | 2025-10-07 | 16340822 | 1.677096 | 2156.034 | 35231369343 | NA | 1.942188 | NA | 0.3847713 | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| 17290 | Zimbabwe | ZW | ZWE | 2024 | 2025-10-07 | 16634373 | 1.780482 | 2656.409 | 44187704410 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend |
For descending sorting:
arrange(wb_data, desc(country), year) |> head(7)| country | iso2c | iso3c | year | status | lastupdated | pop | pop_growth | gdp_percap | gdp | R_D | high_tech_exp | inflation | educ_spending | region | capital | longitude | latitude | income | lending |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zimbabwe | ZW | ZWE | 1960 | 2025-10-07 | 3809389 | NA | 276.4198 | 1052990485 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| Zimbabwe | ZW | ZWE | 1961 | 2025-10-07 | 3930401 | 3.127265 | 279.0165 | 1096646688 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| Zimbabwe | ZW | ZWE | 1962 | 2025-10-07 | 4055959 | 3.144570 | 275.5456 | 1117601690 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| Zimbabwe | ZW | ZWE | 1963 | 2025-10-07 | 4185877 | 3.152908 | 277.0057 | 1159511793 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| Zimbabwe | ZW | ZWE | 1964 | 2025-10-07 | 4320006 | 3.154055 | 281.7445 | 1217138098 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| Zimbabwe | ZW | ZWE | 1965 | 2025-10-07 | 4458462 | 3.154707 | 294.1454 | 1311435906 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend | |
| Zimbabwe | ZW | ZWE | 1966 | 2025-10-07 | 4601217 | 3.151697 | 278.5675 | 1281749603 | NA | NA | NA | NA | Sub-Saharan Africa | Harare | 31.0672 | -17.8312 | Lower middle income | Blend |
Before we continue, we want to introduce a wonder of coding efficiency: THE PIPE OPERATOR.
In fact, there are two of them and they are pretty similar: %>% and |>.
The idea is to chain operations: the outcome of what comes before a pipe becomes the input of the pipe.
We replicate the filter code above but via piping.
wb_data |> filter(country == "India", year > 2020)| country | iso2c | iso3c | year | status | lastupdated | pop | pop_growth | gdp_percap | gdp | R_D | high_tech_exp | inflation | educ_spending | region | capital | longitude | latitude | income | lending |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| India | IN | IND | 2021 | 2025-10-07 | 1414203896 | 0.8226482 | 2239.614 | 3.167271e+12 | NA | 10.21256 | 5.131407 | 4.629500 | South Asia | New Delhi | 77.225 | 28.6353 | Lower middle income | IBRD | |
| India | IN | IND | 2022 | 2025-10-07 | 1425423212 | 0.7902005 | 2347.448 | 3.346107e+12 | NA | 12.68228 | 6.699034 | 4.098658 | South Asia | New Delhi | 77.225 | 28.6353 | Lower middle income | IBRD | |
| India | IN | IND | 2023 | 2025-10-07 | 1438069596 | 0.8832895 | 2530.120 | 3.638489e+12 | NA | 14.93435 | 5.649143 | NA | South Asia | New Delhi | 77.225 | 28.6353 | Lower middle income | IBRD | |
| India | IN | IND | 2024 | 2025-10-07 | 1450935791 | 0.8907065 | 2696.664 | 3.912686e+12 | NA | NA | 4.953036 | NA | South Asia | New Delhi | 77.225 | 28.6353 | Lower middle income | IBRD |
And we check that, indeed, the data is now sorted. But now with pipes.
Henceforth, we will always resort to piping…
Select columns ( = operate on columns).
wb_data |> select(country, year, pop_growth, gdp) |> tail()| country | year | pop_growth | gdp | |
|---|---|---|---|---|
| 17285 | Zimbabwe | 2019 | 1.563533 | 25715657177 |
| 17286 | Zimbabwe | 2020 | 1.659353 | 26868564055 |
| 17287 | Zimbabwe | 2021 | 1.726011 | 27240507842 |
| 17288 | Zimbabwe | 2022 | 1.706209 | 32789657378 |
| 17289 | Zimbabwe | 2023 | 1.677096 | 35231369343 |
| 17290 | Zimbabwe | 2024 | 1.780482 | 44187704410 |
Create new columns with mutate().
wb_data |>
mutate(total_educ_spending = educ_spending * gdp) |>
select(country, year, total_educ_spending) |>
filter(is.finite(total_educ_spending)) |>
tail()| country | year | total_educ_spending | |
|---|---|---|---|
| 6336 | Zimbabwe | 2012 | 103890204643 |
| 6337 | Zimbabwe | 2013 | 114469274331 |
| 6338 | Zimbabwe | 2014 | 119670494331 |
| 6339 | Zimbabwe | 2017 | 102322600059 |
| 6340 | Zimbabwe | 2018 | 70036650843 |
| 6341 | Zimbabwe | 2023 | 13556019685 |
3.4 Tidy vs messy data
Tidy data seems simple:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
But it’s not always easy to spot or understand at first.
A counter-example can help:
data.frame(Year = c(1970, 1990, 2010),
France = c(52, 59, 65),
Germany = c(61, 80, 82),
UK = c(56, 57, 63))| Year | France | Germany | UK |
|---|---|---|---|
| 1970 | 52 | 61 | 56 |
| 1990 | 59 | 80 | 57 |
| 2010 | 65 | 82 | 63 |
“France” is not a variable name. It’s a value for the variable “country”…
Luckily, we have a tool to turn a dataset from messy to tidy: pivot_longer().
All you need to do is:
- Determine which columns (variables) to pivot;
- Pick a name for the new column that will feature the column names;
- Choose another name for the new column that will store the values.
An example below.
data.frame(Year = c(1970, 1990, 2010),
France = c(52, 59, 65),
Germany = c(61, 80, 82),
UK = c(56, 57, 63)) |>
pivot_longer(-Year, names_to = "Country", values_to = "Population")| Year | Country | Population |
|---|---|---|
| 1970 | France | 52 |
| 1970 | Germany | 61 |
| 1970 | UK | 56 |
| 1990 | France | 59 |
| 1990 | Germany | 80 |
| 1990 | UK | 57 |
| 2010 | France | 65 |
| 2010 | Germany | 82 |
| 2010 | UK | 63 |
Same information, but structured/presented differently.
What happened? See below…
Another example on a transposed version (countries in rows and years in columns):
3.5 Pivot tables
These beasts can be incredibly useful. They require two steps:
- Determine the dimensions (categorical columns) along which to perform the analysis: this is done with group_by();
- Define the key metrics to compute: this is done with summarize().
wb_data |>
filter(year == 2023, region != "Aggregates") |>
group_by(region) |>
summarise(total_population = sum(pop))| region | total_population |
|---|---|
| East Asia & Pacific | 2361106057 |
| Europe & Central Asia | 925640799 |
| Latin America & Caribbean | 654413871 |
| Middle East & North Africa | 508842198 |
| North America | 376954413 |
| South Asia | 1951539835 |
| Sub-Saharan Africa | 1241543738 |
Below, “na.rm = T” means that NA data will be removed before applying the function.
wb_data |>
filter(region != "Aggregates") |>
group_by(region, year) |>
summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T)) |>
head(8)| region | year | avg_gdp_percap |
|---|---|---|
| East Asia & Pacific | 1960 | 522.1182 |
| East Asia & Pacific | 1961 | 535.4375 |
| East Asia & Pacific | 1962 | 546.7194 |
| East Asia & Pacific | 1963 | 594.2096 |
| East Asia & Pacific | 1964 | 636.9282 |
| East Asia & Pacific | 1965 | 785.9105 |
| East Asia & Pacific | 1966 | 839.3606 |
| East Asia & Pacific | 1967 | 795.2839 |
A last example.
wb_data |>
group_by(income) |>
summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T))| income | avg_gdp_percap |
|---|---|
| Aggregates | 5285.0354 |
| High income | 22472.9538 |
| Low income | 433.7686 |
| Lower middle income | 1150.8568 |
| Not classified | 4101.0849 |
| Upper middle income | 3475.4895 |
| NA | 4405.7019 |
What is a low income country?
wb_data |>
filter(income == "Low income") |>
select(country) |>
distinct()| country |
|---|
| Afghanistan |
| Burkina Faso |
| Burundi |
| Central African Republic |
| Chad |
| Congo, Dem. Rep. |
| Eritrea |
| Ethiopia |
| Gambia, The |
| Guinea-Bissau |
| Korea, Dem. People’s Rep. |
| Liberia |
| Madagascar |
| Malawi |
| Mali |
| Mozambique |
| Niger |
| Rwanda |
| Sierra Leone |
| South Sudan |
| Sudan |
| Syrian Arab Republic |
| Togo |
| Uganda |
| Yemen, Rep. |
3.6 Plots
In “ggplot”, the GG stands for grammar of graphics, a very neat way to think of plots.
It decomposes graphs into specific elements (layers), see illustration below.
Of the above, only the bottom 3 are indispensable. The above one are for customization.
Geometries are plot types, see the poster below (link on github).
To find inspiration, you can also type “chart chooser” on Google…
Let’s illustrate this with a few examples. Below, we show information along 4 dimensions:
- x-axis
- y-axis
- color
- shape of point
wb_data |>
filter(region != "Aggregates") |>
group_by(region, income) |>
summarise(avg_wealth = mean(gdp_percap, na.rm = T),
avg_educ = mean(educ_spending, na.rm = T)) |>
na.omit() |>
ggplot(aes(x = avg_educ, y = avg_wealth, color = region, shape = income)) +
geom_point(size = 5) +
xlab("Eduction spending") + ylab("Wealth") +
theme_classic()Other potential dimensions include:
- alpha (transparency)
- size (for points)
- fill (the inside of shapes/rectangles)
- linetype (for lines)
- linewidth (for lines)
Look at the %in% operator below… much better than OR |
wb_data |>
filter(country %in% c("France", "Italy", "United Kingdom")) |>
ggplot(aes(x = year, y = pop/10^6, color = country)) +
geom_line() + geom_point() +
theme_classic() +
theme(axis.title = element_blank(),
text = element_text(size = 13),
legend.text = element_text(size = 13),
legend.title = element_text(face = "bold", size = 15),
legend.position = c(0.2,0.8))A closer look at the syntax:
Both elements the aesthetics and the geom type are crucial.
Also, mind the “+” that separates the layers, it is not a pipe!
Another example, using a recycled pivot table.
na.omit() removes rows with missing data.
wb_data |>
group_by(income) |>
summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T)) |>
na.omit() |>
ggplot(aes(y = reorder(income, avg_gdp_percap), x = avg_gdp_percap)) +
geom_col() +
xlab("Average GDP per capita") + ylab("") +
theme_light()A last one for the road.
wb_data |>
filter(region != "Aggregates") |>
group_by(region, year) |>
summarise(avg_gdp_percap = mean(gdp_percap, na.rm = T)) |>
ggplot(aes(x = year, y = avg_gdp_percap, color = region)) +
geom_line(linewidth = 0.9) +
scale_y_log10() +
theme_classic() +
theme(axis.title = element_blank(),
text = element_text(size = 13)) +
ggtitle("GDP per capita (log-scale)")4 Other datasets
4.1 Airline traffic
Let’s have a look at some data. Below, we dive into airline traffic, obtained from Aéroport de Paris.1
Upon inspection, we need to open the third spreadsheet and focus on the first three columns:
- date
- number of passengers for CDG
- number of passengers for ORY
This file was manually edited to avoid errors (duplicated dates in the original sample.).
url <- "https://github.com/shokru/coqueret.github.io/raw/refs/heads/master/files/misc/time_series/traffic-sheet.xlsx"
airline <- read.xlsx(url, sheet = 3, startRow = 2)
head(airline) X1 Paris.-.Charles.de.Gaulle Paris.-.Orly Total
1 36526 3223328 1935261 5158589
2 36557 3289676 1942750 5232426
3 36586 3891206 2204640 6095846
4 36617 4221430 2266448 6487878
5 36647 4217758 2203570 6421328
6 36678 4279344 2190218 6469562
Paris.-.Charles.de.Gaulle Paris.-.Orly Total
1 39489 19956 59445
2 38386 19347 57733
3 42049 21286 63335
4 42931 20192 63123
5 43909 20921 64830
6 41966 19331 61297
Next, let do a bit of wrangling.
air_data <- airline |> select(1:3) # Keep only the first 3 columns
colnames(air_data) <- c("date", "CDG", "ORY") # Rename these columns
air_data <- air_data |> # Reformat the date
mutate(date = as.Date(as.numeric(date), origin = "1899-12-30")) |>
filter(is.finite(date), is.finite(CDG)) |>
arrange(date) To make the most of the {fable} package, we need to embed the dataframe into a tsibble, a kind of strange animal (data format) that the package really likes.
air_data <- air_data |>
mutate(date = yearmonth(date)) |>
distinct(date, .keep_all = T) |>
as_tsibble() |>
fill_gaps()Let’s plot all of this!
air_data |>
pivot_longer(-date, names_to = "airport", values_to = "passengers") |>
ggplot(aes(x = date, y = passengers/10^6, color = airport)) + geom_line() +
theme_classic() + ggtitle("Passengers in millions") +
theme(axis.title = element_blank(),
title = element_text(face = "bold"),
legend.position = c(0.1,0.9))What are some patterns that we can identify?
air_data %>%
model(classical_decomposition(CDG)) %>%
components() %>%
autoplot() + theme_light()A look at seasonality.
air_data |>
as.data.frame() |>
mutate(month = month(date)) |>
group_by(month) |>
summarise(avg_pass = mean(CDG/10^6)) |>
ggplot(aes(x = as.factor(month), y = avg_pass)) + geom_col() +
xlab("month") + ylab("")Summer months, as expected, have more passenger rotations.
4.2 Atmospheric CO2 concentration
url <- "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv"
co2 <- read.csv(url, skip = 40)
co2 <- co2 |>
mutate(date = make_date(year = year, month = month, day = 15)) |>
select(date, average, deseasonalized) |>
mutate(date_iso = date,
date = yearmonth(date)) |>
as_tsibble(index = date)A sneak peek…
co2 |>
pivot_longer(-c(date, date_iso), names_to = "series", values_to = "value") |>
ggplot(aes(x = date, y = value, color = series)) + geom_line() +
theme_classic() + ggtitle("CO2 concentration (ppm)") +
theme(legend.position = c(0.2,0.8),
text = element_text(size = 15),
legend.title = element_blank(),
axis.title = element_blank())How can we explain the oscillations?
Let’s decompose this.
co2 %>%
model(classical_decomposition(average)) %>%
components() %>%
autoplot() +
theme_light()4.3 Financial series
4.3.1 Stocks
tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS"
min_date <- "2000-01-01" # Starting date
max_date <- "2025-10-30" # Ending date
prices <- getSymbols(tickers, src = 'yahoo', # The data comes from Yahoo Finance
from = min_date, # Start date
to = max_date, # End date
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>% # Retrieving the data
reduce(merge) %>% # Merge in one dataframe
`colnames<-`(tickers) # Set the column names
prices |> tail() AAPL BA C PFE WMT XOM
2025-10-22 258.45 216.59 96.30 24.72 107.14 114.71
2025-10-23 259.58 217.77 96.69 24.67 106.86 115.98
2025-10-24 262.82 221.35 98.78 24.76 106.17 115.39
2025-10-27 268.81 223.00 100.99 24.77 104.47 115.94
2025-10-28 269.00 223.33 101.39 24.50 103.17 115.03
2025-10-29 269.70 213.58 99.12 24.29 102.46 116.45
prices %>%
as.data.frame() %>%
rownames_to_column(var = "Date") %>%
mutate(Date = as.Date(Date)) |>
pivot_longer(-Date, names_to = "Asset",
values_to = "Price") %>%
ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() +
facet_wrap(vars(Asset), scales = "free") + theme_light()Any pattern that you can recognize?
4.3.2 Cryptocurrencies
We will use the {crypto2} package below. If need be: install it.
First, let’s have a look at the list of available coins… which are numerous!
coins <- crypto_list() You can have a look at the info for the coins via the code below.
c_info <- crypto_info(coin_list = coins, limit = 30)❯ Scraping crypto info
❯ Processing crypto info
Next, we can download historical quotes.
Symbols have duplicates, we need to use “slugs”.
coin_symb <- c("bitcoin", "ethereum", "tether", "xrp")
coin_hist <- crypto_history(coins |> dplyr::filter(slug %in% coin_symb),
start_date = "20170101",
end_date = "20250925")❯ Scraping historical crypto data
❯ Processing historical crypto data
coin_hist <- coin_hist |> # Timestamps are at midnight.
mutate(date = as.Date(as.POSIXct(timestamp, origin="1970-01-01")))Mind the log-scale!
coin_hist |>
ggplot(aes(x = date, y = close, color = name)) + geom_line() +
scale_y_log10() + theme_bw() + scale_color_d3() +
theme(legend.position = c(0.75,0.5),
axis.title = element_blank(),
legend.title = element_blank()) Any pattern that you can recognize here again?
Unfortunately, this kind of data is hard to analyze directly - more on that next time!
5 Descriptive statistics (to be done in class)
Pick a dataset… & explore!
- What is the average rate of increase of CO in the atmosphere?
- Using the code below, determine during which month airports see the most passengers.
- Which stock/crypto had the best performance over the sample we downloaded?
- Which country is the richest (GDP per capita), and which one spends the most on R&D, or education?
air_data |> mutate(month = month(date)) 6 Wrap-up
The things you need to remember:
- the tidyverse functions: filter, select, arrange, mutate, group_by and summarize. They are incredible tools to analyze data rapidly.
- There are a small number of patterns that are easy to recognize (seasonality, trends). But there are also unpredictable shocks, some small, some very large. It is these shocks that we will try to model henceforth.
Footnotes
See at the bottom of the page, downloaded in October and covering 2000-2025.↩︎